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The remarkable abilities of the primate visual system have inspired the construction of 
computational models of some visual neurons. We propose a trainable hierarchical object 
recognition model, which we call S-COSFIRE (S stands for Shape and COSFIRE stands 
for Combination Of Shifted Filter REsponses) and use it to localize and recognize objects 
of interests embedded in complex scenes. It is inspired by the visual processing in the 
ventral stream (V1/V2 ^ V4 ^ TEO). Recognition and localization of objects embedded 
in complex scenes is important for many computer vision applications. Most existing 
methods require prior segmentation of the objects from the background which on its 
turn requires recognition. An S-COSFIRE filter is automatically configured to be selective 
for an arrangement of contour-based features that belong to a prototype shape specified 
by an example. The configuration comprises selecting relevant vertex detectors and 
determining certain blur and shift parameters. The response is computed as the weighted 
geometric mean of the blurred and shifted responses of the selected vertex detectors. 
S-COSFIRE filters share similar properties with some neurons in inferotemporal cortex, 
which provided inspiration for this work. We demonstrate the effectiveness of S-COSFIRE 
filters in two applications: letter and keyword spotting in handwritten manuscripts and 
object spotting in complex scenes for the computer vision system of a domestic robot. 
S-COSFIRE filters are effective to recognize and localize (deformable) objects in images of 
complex scenes without requiring prior segmentation. They are versatile trainable shape 
detectors, conceptually simple and easy to implement. The presented hierarchical shape 
representation contributes to a better understanding of the brain and to more robust 
computer vision algorithms. 



Keywords: hierarchical representation, object recognition, shape, ventral stream, vision and scene understanding, 
robotics, handwriting analysis 



1. INTRODUCTION 

Shape is perceptually the most important visual characteristic 
of an object. Although there is no formal definition — as with 
most perceptual related concepts — it is understood that the two- 
dimensional shape of an object is characterized by the relative 
spatial positions of a collection of contour-based features. 

Let us consider, for instance, the square in Figure lA, which 
we refer to as a reference or prototype object. From the point 
of view of visual perception the incomplete object in Figure IB 
is very similar to the prototype even though it is composed of 
only 25% of the contour pixels of the reference object. On the 
contrary, the closed polygon in Figure IC, which has the bot- 
tom half equivalent to that of the prototype is perceptually less 
similar to it. Furthermore, there is little perceptual similarity 
between the prototype and its scrambled contour parts shown in 
Figure ID. 

As a matter of fact, there is neurophysiological evidence that 
objects, such as faces, are recognized by detecting certain features 
that are spatially arranged in a certain way (Kobatake and Tanaka, 



1994). By means of single-cell recordings in adult monkeys it was, 
for instance, found that a neuron in inferotemporal cortex gives 
similar responses for the two images shown in Figures 2A,B- The 
icon presented in Figure 2B is a simplified version of the mon- 
key's face shown in Figure 2A. It only consists of a circle that 
surrounds a horizontally- aligned pair of spots on top of a hori- 
zontal bar. Removing one of these features. Figures 2C,D, causes 
the concerned cell to give very small response. 

Another neurophysiological study (Brincat and Connor, 2004) 
reveals that some neurons in inferotemporal cortex integrate 
information about the curvatures, orientations, and positions of 
multiple (typically 2-4) simple contour elements, such as angles 
or curved contour segments. In that study the authors argue that 
their findings are in line with other studies that support parts- 
based shape representation theories (Marr and Nishihara, 1978; 
Riesenhuber and Poggio, 1999; Mel and Fiser, 2000; Edelman and 
Intrator, 2003), and suggest that non-linear integration in the 
inferotemporal cortex might help to extend sparseness of shape 
representation along the ventral stream. 
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FIGURE 1 I (A) A prototype shape. (B) A test pattern that has only 25% 
sinnilarity (computed by template matching) to the prototype is perceptually 
more similar to the prototype than the polygon in (C) and the set of contour 
parts in (D), both of which have 50% similarity (computed by template 
matching) to the prototype. 



Tsotsos (1990) showed that hierarchical architectures are more 
appropriate for object detection in contrast to unbounded visual 
search which is known to be NP-complete. This has led to the 
proposal of a number of hierarchical models (Mel and Fiser, 
2000; Scalzo and Piater, 2005; DiCarlo and Cox, 2007; Rodriguez- 
Sanchez and Tsotsos, 2012). Existing approaches that consider 
the spatial relationship of features include the so-called standard 
model (Serre et al, 2007), some probabiHstic techniques, such 
as the generative constellation model (Fergus et al, 2003; Fei-Fei 
et al., 2007) and a hierarchical model of object categories (Fidler 
and Leonardis, 2007; Fidler et al., 2008). These approaches rely on 
summation of the responses of elementary feature detectors and 
may find the images in Figures 1C,D quite similar to the proto- 
type in Figure lA. For instance, such a technique may consider 
a circle with a horizontal line within it as a face even though the 
representations of the eyes are missing. Figures 2C,D. 

We introduce a hierarchical object detection technique which 
is motivated by the shape selectivity of some neurons in 
inferotemporal cortex. The principal idea is to construct a 
shape -selective filter that combines the responses of some sim- 
pler filters that detect some partial features of the concerned 
shape in specific positions that are characteristic of that shape. 
We call this approach to the construction of filters Combination 
Of Shifted Filter REsponses (COSFIRE). We successfully applied 
this approach to the construction of line and edge detectors 



(Azzopardi and Petkov, 2012; Azzopardi et al., 2014) and simple 
contour- related features, such as vascular bifurcations (Azzopardi 
and Petkov, 2013b). In Azzopardi and Petkov (2013b) we demon- 
strated how the collective responses of multiple COSFIRE filters 
to segmented patterns, such as handwritten digits, can be used 
to form a shape descriptor with high discrimination ability. That 
descriptor, however, does not take into account the relative spa- 
tial arrangement of the concerned features. Similar to other shape 
descriptors (Belongie et al., 2002; Grigorescu and Petkov, 2003; 
Ghosh and Petkov, 2005; Latecki et al, 2005; Lauer et al, 2007; 
Ling and Jacobs, 2007; Goh, 2008; Almazan et al, 2012) that 
approach works well with segmented objects, but it is not effec- 
tive for the detection of objects embedded in complex scenes. 
In order to distinguish the two types of filter, we refer to the 
composite shape-selective filter that we propose in this paper as 
S- COSFIRE and to the filter proposed in Azzopardi and Petkov 
(2013b) as y-COSFIRE (5 and V stand for shape and vertex, 
respectively). 

There are three aspects in which the 5- COSFIRE filters that 
we propose differ from other hierarchical models that also con- 
sider the spatial geometric arrangement of parts. First, our model 
is implemented in a filter that gives a scalar response (between 
0 and 1) for each position in the image. The higher the value 
the more similar the shape around the concerned location is to 
the prototype shape. An S-COSFIRE filter can be thought of a 
model of a shape-selective neuron in inferotemporal cortex of the 
type studied in Kobatake and Tanaka (1994); Brincat and Connor 
(2004), which fires only when a specific arrangement of contour- 
based features is present in its receptive field. It addresses object 
recognition and localization as a joint problem, which is in line 
with how Marr (1982) defined the sense of seeing: "... to know 
what is where by looking." In contrast, the other methods referred 
to above use multiple prototypes and consider several responses 
from different feature detectors to form a mixture of probability 
distributions or a vector of responses. For these methods, the geo- 
metrical spatial arrangement of the concerned prototype defining 
parts is achieved by training a supervised classifier and subse- 
quently the similarity between a test pattern and a prototype is 
computed by a distance metric. Moreover, they suffer from insuf- 
ficient robustness to localization because they treat this matter at 
a region level (sliding window) rather than at a pixel level. 

Second, since the omission of an object part can radically 
change shape perception, we regard every feature (and its rela- 
tive position) that forms part of a prototype shape as essential. 
This aspect is implemented as an AND -type operation of an S- 
COSFIRE filter. It is in contrast to other models that rely on 
summation, and therefore achieve a response even when any of 
the prototype-defining features is missing. These models may 
thus match objects that are perceptually different. 

Third, while the 5-COSFIRE approach that we present achieves 
invariance to rotation, scaling, and reflection by simply manip- 
ulating some model parameters, the other techniques can only 
achieve invariance to such geometric transformations by extend- 
ing the training set with example objects that are rotated, scaled 
and/or reflected versions of a prototype. 

The rest of the paper is organized as follows: in section 
2 we present the proposed hierarchical S-COSFIRE model. In 




FIGURE 2 I (A-D) A set of stinnuli used in an electrophysiological study 
Kobatake and Tanaka (1994) to test the selectivity of a neuron in 
inferotennporal cortex. (Bottom) The activity of the concerned neuron for 
the corresponding stinnuli. The neuron gives high response only when the 
stinnulus contains a detailed or sinnplified representation of the face 
boundary that surrounds a pair of eyes on top of a nnouth. If any of these 
features is nnissing, the neuron gives negligible response. 
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section 3, we demonstrate its effectiveness in two applications: 
keyword spotting in handwritten manuscripts and vision for 
a home tidying pickup robot. Section 4 contains a discussion 
on the properties of the 5-COSFIRE filters and finally we draw 
conclusions in section 5. 

2. METHODS 

The following example illustrates the main idea of the proposed 
method. We consider the triangle, shown in Figure 3A, as a shape 
of interest and we call it prototype. We use this prototype to 
automatically configure an S-COSFIRE filter that will respond to 
shapes that are identical with or similar to this prototype. 

A shape-selective 5-COSFIRE filter takes input from simpler 
filters; here filters that are selective for vertices. We use vertex- 
selective COSFIRE filters of the type proposed in Azzopardi and 
Petkov (2013b) to detect the vertices of the prototype shape. 
Such a filter, which we refer to it as COSFIRE, combines the 
responses of line detectors, the areas of support of which are 
indicated by the small ellipses in Figure 3A. 

The response of an 5-COSFIRE filter is computed by com- 
bining the responses of the concerned ^-COSFIRE filters in the 
centers of the corresponding circles by weighted geometric mean. 
The preferred orientations and the preferred apertures of these 
filters together with the locations at which we take their responses 
are determined by analysing the responses of a set of ^-COSFIRE 
filters to the prototype shape. Consequently, the 5-COSFIRE fil- 
ter will be selective for the given spatial arrangement of vertices 
of specific orientations and apertures. Taking the responses of 
y- COSFIRE filters at different locations around a point can be 
implemented by shifting the responses appropriately before using 
them for the pixel-wise evaluation of a multivariate function 
which gives the S- COSFIRE filter output. 

2.1. DETECTION OF VERTEX FEATURES BY I/-COSFIRE FILTERS 

We denote by rvf. {x, y) the response of a V-COSFIRE filter Vf. 
that is selective for a vertex fi. We threshold these responses 
at a given fraction ti (0 < ti < 1) of the maximum response 
across all image coordinates (x, y) and denote these thresholded 
responses by | r^y. (x, /) 1 . We use the publicly available Matlab 



implementation^ of V- COSFIRE filters. Such a filter uses as input 
the responses of given channels of a bank^of Gabor filters. For fur- 
ther technical details about the properties of COSFIRE filters 
we refer to Azzopardi and Petkov (2013b). 

We use a bank of y- COSFIRE filters that are selective for ver- 
tices of different orientations (in intervals of 7r/6 radians) and 
different apertures (in intervals of 7r/6 radians). Figure 3B. For 
the considered prototype the strongest responses are obtained by 
three y- COSFIRE filters that are selective for vertices of the types 
/i3)/i75 and/21, shown in Figure 3B. The corresponding locations, 
(^i,yi)) (^2,72)) (^3,73)) at which they obtain the maximum 
responses are indicated in Figure 3C. 

2.2. CONFIGURATION OF AN 5-COSFIRE FILTER 

An 5-COSFIRE filter uses as input the responses of selected V- 
COSFIRE filters V^.. , i = I . . .tiy each selective for some vertex 
fj., around a certain position (pi, 4>i) with respect to the center 
of the S-COSFIRE filter. A 3-tuple (Vf.^, Pi, (pi) that consists of a 
y- COSFIRE filter specification V^. and two scalar values {pi, 0/) 
characterizes the properties of a vertex that is present in the given 
prototype shape: V^. represents a V-COSFIRE filter that is selec- 
tive for a vertex/-, and (pi, (pi) are the polar coordinates of the 
location at which its response is taken with respect to the cen- 
ter of the S- COSFIRE filter. In the following we explain how we 
obtain the parameter values of such vertices around a given point 
of interest. 

For each location in the input image of the prototype shape 
we take the maximum value of all responses achieved by the bank 
of y-COSFIRE filters mentioned above. The positions that have 
values greater than those of their corresponding 8 -neighbors are 
chosen as the points that have local maximum responses. For each 
such point (xi,yi) we determine the polar coordinates (pi, (pi) 
with respect to the center of the 5-COSFIRE filter. Figure 3C. 



^The Matlab implementation of a y-COSFIRE filter can be downloaded fi-om 
http://matlabserver.cs.rug.nl/ 

^Here we use a bank of Gabor filters with five wavelengths 
k = {4, 4V2, 8, 8V2, 16} and six equidistant orientations 6 e 

{n Jt 7T 7t lit StT 1 
^' 6 ' 3 ' 2 ' 3 ' 6 J 




FIGURE 3 I (A) The triangle is the prototype shape of interest and the " + " 
nnarker indicates the center of the user-specified large circle. The snnall 
circles indicate the supports of three vertex detectors that are identified as 
relevant for the concerned prototype shape. The snnall ellipses represent the 
supports of line detectors that are selective for the contour parts of the 
corresponding vertices. (B) A data set of 60 synthetic vertices, fi , . . . , feo 
(left-to-right, top-to-bottonn). A \/-COSFIRE filter \/f^ is selective for a vertex 



f\^. (C) Configuration of an S-COSFI RE filter. The "x" nnarkers indicate the 
locations, (xi , y^), (x2, y-}), (X3, ks), where the corresponding three 
\/-COSFIRE filters, l/f^g, \/^^, Vf^^ , achieve the nnaximunn responses. These 
locations correspond to the three vertices of the prototype shape, which is 
rendered here with low contrast. The Cartesian coordinates of each point 
(x,, are converted into the polar coordinates (p,, 0,) with respect to the 
given point of interest [x' , /), indicated by the "+" marker. 
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Then we determine the y-COSFIRE filters, the responses of 
which are greater than a fraction t2 = 0.75 of the maximum 
response rvf.{x,y) for all z G {1, . . . Uf} where tif is the num- 
ber of y-COSFIRE filters used across all locations in the input 
image. Thus, multiple y-COSFIRE filters can be significantly 
activated for the same location {pi, The selected points char- 
acterize the dominant vertices in the given prototype shape of 
interest. 

We denote by Ss = {(^. ' Pi^ 0i) I i = 1 • . . ^/} the set of 
parameter value combinations, which describes the proper- 
ties and locations of a number of vertices. The subscript S 
stands for the prototype shape of interest. Every tuple in set 
Ss specifies the parameters of some vertex in prototype S. 
For the prototype shape of interest in Figure 3A, the selec- 
tion method described above results in three vertices with 
parameter values specified by the tuples in the following 
set: Ss = {(V^-^=2i' Pi = 50, 0i = 7r/2), p2 = 50, 02 = 

771/6), {Vf.^^^^, P3 = 50, 03 = 571/3)}. 

2.3. BLURRING AND SHIFTING I/-CGSFIRE RESPGNSES 

The above configuration results in an 5-COSFIRE filter that is 
selective for a preferred spatial arrangement of three vertices 
forming an equilateral triangle. Next, we use the responses of 
the y-COSFIRE filters that are selective for the correspond- 
ing vertices to compute the output of the S-COSFIRE filter as 
follows. 

First, we blur the responses of the y-COSFIRE filters in order 
to allow for some tolerance in the position of the respective ver- 
tices. This increases the generalization ability of the 5-COSFIRE 
filter under construction. We define the blurring operation as 
the computation of maximum value of the weighted thresholded 
responses of a y-COSFIRE filter. For weighting we use a Gaussian 
function Gcr(x, y), the standard deviation a of which is a linear 
function of the distance p from the center of the 5-COSFIRE fil- 
ter: a = Go -\- ap where gq and a are constants. The choice of 
this linear function is inspired by the visual system of the brain 
for which we provide more detail in section 4. For a > 0, which 
we use, the tolerance to the position of the respective vertices 
increases with an increasing distance p from the support center 
of the concerned S-COSFIRE filter. 

Second, we shift the blurred responses of each y-COSFIRE 
filter by a distance pi in the direction opposite to 0f. With this 
shifting the concerned y-COSFIRE filter responses, which are 
located at different positions {pi, (pi) meet at the support cen- 
ter of the 5-COSFIRE filter. The output of the 5-COSFIRE filter 
can then be evaluated as a pixel-wise multivariate function of the 
shifted and blurred responses of y-COSFIRE filter responses. In 
polar coordinates, the shift vector is specified by (p/ , 0/ + 7r), and 
in Cartesian coordinates, it is (Ax^, A//) where Ax/ = — coscpi, 
and Ayi = — Pi sin (pj. We denote by svy. 7), the blurred 

and shifted thresholded response of a y-COSFIRE filter that is 
specified by the i-th tuple (Vf.^ , pi, 0/) in the set Ss: 

svf. , (x, y) = max \ Wf. {x-x-Axi, y-y-Ayi) Ga{x,/)\ , 

x' ,y' (I H \ti J 

where — 3(7 < x^y < 3a (1) 



Figure 4 illustrates the blurring and shifting operations 
for this 5-COSFIRE filter, applied to the image shown in 
Figure 3A. 

We define the response rs^ (x, y) of an 5-COSFIRE filter as the 
weighted geometric mean of the blurred and shifted thresholded 
responses of the selected y-COSFIRE filters svy. ,p^,(j)^{x, y)\ 




COi = exp 2a'2 , 0 < ^3 < 1 (2) 

where stands for thresholding the response at a fraction 
of its maximum across all image coordinates (x, y). For 1/a^ = 0, 
the computation of the 5-COSFIRE filter is equivalent to the 
standard geometric mean, where the s- quantities have the same 
contribution. Otherwise, for 1/a^ > 0, the input contribution of 
s- quantities decreases with an increasing value of the correspond- 
ing parameter p. In our experiments we use a value of the stan- 
dard deviation that is computed as a function of the maximum 
value of the given set of p values: = { — Pmax 

2/2 In 0.5)1/2, 

where Pmax = ^^^ie{i...\Ss\}iPi]- make this choice in order 
to achieve a maximum value oj = \ of the weights in the center 
(for p = 0), and a minimum value = 0.5 in the periphery (for 

P = Pmax). 

Figure 4D shows the output of an 5-COSFIRE filter which is 
defined as the weighted geometric mean of three blurred and 
shifted response images obtained by the three concerned V- 
COSFIRE filters. Note that this filter responds in the middle of 
a spatial arrangement of three vertices that is identical with or 
similar to that of the prototype shape S, which was used for 
the configuration of the 5-COSFIRE filter. In this example, the 
S-COSFIRE filter reacts strongly in a given point that is sur- 
rounded by three vertices each having an aperture of 7r/3 radians: 
one northward-pointing, another one south-west-pointing and a 
south-east-pointing vertex to the north, south-west, and south- 
east of that point, respectively. Besides the complete triangle that 
was used for configuration, the concerned filter also detects the 
Kanizsa-type illusory triangle. This is in line with neurophysio- 
logical and psychophysical evidence, in that the visual system is 
capable of detecting a shape with illusory contours, based on its 
visible salient parts. A thorough review of this phenomenon is 
provided in Roelfsema (2006). 

2.4. TGLERANCETG GEGMETRIC TRANSFGRMATIGNS 

The proposed 5-COSFIRE filters are tolerant to rotations, scales 
and reflections. Similar to a y-COSFIRE filter, such a toler- 
ance is achieved by manipulating the values of some parameters 
rather than by configuring separate filters by rotated, scaled, and 
reflected versions of the prototype shape of interest. 

2.5. TGLERANCETG RGTATIGN 

Using the set 5s that defines the concerned S-COSFIRE filter, 
we form a new set 9^^1/^(55) that defines a new filter, which is 



Frontiers in Computational Neuroscience 



www.frontiersin.org 



July 2014 I Volume 8 | Article 80 | 4 



Azzopardi and Petkov 



Ventral-stream-like shape representation 



Input image 



y-COSFIRE y-COSFIRE 
filters responses 



Prototype 




A 




5-COSFIRE 


structure 













Blur and shift Blurred and shifted 
y -CQSFIRE responses 

(blur) 
q-i=9.29 




u;i=0.5, CJ2=0.5, CJ3 = 0.5, 
Q = (jji -\- u!2 -\- <^3 = 1-5 

D 5-COSFIRE 

responses 




FIGURE 4 I (A) Input innage (of size 512x512 pixels). The enframed inlay 
images show (top) the enlarged prototype shape of interest, which is 
identical to the equilateral triangle in the input image and (bottom) the 
structure of the S-COSFIRE filter that is configured by this prototype. The 
ellipses illustrate the wavelengths and orientations of the Gabor filters that 
are used by the V-COSFIRE filters, and the dark blobs are intensity maps 
for blurring (Gaussian) functions. The blurred responses are then shifted by 
the corresponding vectors. (B) The V-COSFIRE filters that are 
automatically identified from the prototype shape and the corresponding 
response images to the input image. (C) We then blur (here we use 
(70 = 0.1 and a = 0.0853 to compute a,) the thresholded (here at ti =0) 



response I r^^. (x, /) | of each concerned \/-COSFIRE filter and 
subsequently shift the resulting blurred response images by corresponding 
polar-coordinate vectors (p/,0,+7r). (D) We use weighted geometric mean 
(here a' = 91 .44) of all the blurred and shifted \/-COSFIRE filters to 
compute (top) the output of the S-COSFIRE filter and show (bottom) the 
reconstruction of the detected features. The reconstruction is achieved by 
superimposing the Gabor filter responses that give input to the S-COSFIRE 
filter. The two local maxima in the output of the S-COSFIRE filter 
correspond to the triangle and to the perceived one in the input image. 
For better clarity we use inverted gray-level rendering to show the images 
in the right of the columns (B-D). 



selective for a version of the prototype shape S that is rotated by 
an angle 

= [{^fiVf^X Pu I V (y^.^., Pi, 0i) eSs) (3) 

For each tuple (V^.., pi, 0/) in the original filter Ss that describes 
a certain vertex of the prototype shape, we provide a counter- 
part tuple (9^i/r(Vj. ), Pu 0i + V'') in the new set ^^^^/^(Ss). The set 
defines^ a y-COSFIRE filter that is selective for vertex 
j^. that is also rotated by an angle The orientation of the con- 
cerned vertex and its polar angle position 0^- with respect to the 
support center of the 5-COSFIRE filter are off-set by an angle 
relative to the values of the corresponding parameters of the 
original vertex. 

A rotation -invariant response is achieved by taking the max- 
imum value of the responses of filters that are obtained with 
different values of the parameter 

rssU^r) ^= max {r^^(Ss)(^' /)} (4) 



^We refer to Azzopardi and Petkov (2013b) for the technical details about the 
invariance that is achieved by a y-COSFIRE filter. 



where ^ is a set of equidistant orientations defined as ^ = 
{i^i|0<i<n^}. 

2.6. TOLERANCE TO SCALING 

Tolerance to scaling is achieved in a similar way. Using the set 5s 
that defines the concerned S-COSFIRE filter, we form a new set 
^u(5s) that defines a new filter, which is selective for a version of 
the prototype shape S that is scaled in size by a factor u: 

r.(Ss) = {(^.(V^J, vpu (pi) I V (Vf.^,pi, 0i) gSs| (5) 

For each tuple (V^. , pi, in the original S-COSFIRE filter Ss 
that describes a certain vertex of the prototype shape, we pro- 
vide a counterpart tuple {T^iVf-X ^Pi, 0i) in the new set Ty^Ss)- 
The set Ty{Vf.,) defines^ a V-COSFIRE filter that responds to a 
version of the vertex^-, scaled by the factor v. The size of the con- 
cerned vertex and its distance to the center of the filter are scaled 
by the factor v relative to the original values of the corresponding 
parameters. 

A scale -invariant response is achieved by taking the maximum 
value of the responses of filters that are obtained with different 
values of the parameter v: 
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hsi^^y) = max{rr^(5 (6) 

where T is a set of v values equidistant on a logarithmic scale 
defined as T = {2^ | i g Z}. 

2.7. REFLECTION INVARIANCE 

As to reflection invariance we first form a new set 5s from the set 
5s as follows: 

Ss = , Pu 7T -(Pi) I V (Vf.^ , Pi, 0i) e 5s} (7) 

The set V^. defines^ a new y-COSFIRE filter that is selective for 
the corresponding vertex fj. reflected about the /-axis. Similarly, 
the new 5-COSFIRE filter 5s is selective for a reflected version 
of the prototype shape S also about the y— axis. A reflection- 
invariant response is achieved by taking the maximum value of 
the responses of the filters 5s and 5s: 

def 

rss y) = max {rsg {x, y), r^^ {x, y)} (8) 

2.8. COMBINED TOLERANCE TO ROTATION, SCALING, AND 
REFLECTION 

An 5-COSFIRE filter achieves tolerance to all the above geometric 
transformations by taking the maximum value of the rotation- 
and scale-tolerant responses of the filters 5s and 5s that are 
obtained with different values of the parameters and v: 

fssi^,y) =\^^^^ {rm^iTASs))(^^y)^ %(r,(Ss))(^'>^)l 
3. APPLICATIONS 

In the following we demonstrate the effectiveness of the proposed 
5-COSFIRE filters by applying them in two practical applications: 
the spotting of keywords in handwritten manuscripts and the 



spotting of objects in complex scenes for the computer vision 
system of a domestic robot. 

3.1. SPOTTING KEYWORDS IN HANDWRIHEN MANUSCRIPTS 

The automatic recognition of keywords in handwritten 
manuscripts is an application that has been extensively investi- 
gated for several decades (Plamondon and Srihari, 2000; Frinken 
et al, 2012). Despite this effort the problem has not been 
solved yet. 

As a demonstration, in Figure 5 we show how to detect the 
keyword "Germany" in two handwritten manuscripts. We use 
the keyword prototype "Germany" that is shown enframed in 
Figure 5 A to configure an 5-COSFIRE filter that receives input 
from 13 y-COSFIRE filters, Figure 5E. Figures 5C,D show the 
responses of the concerned 5-COSFIRE filter (ti = O.I, t2 = 0.75, 
^3=0.1, ao = 0.67, and a = 0.1.) to the two manuscript images^ 
in Figures 5A,B. It spots all the six instances of the keyword 
"Germany" and does not produce any false positives. 

The 5-COSFIRE filters that are selective for specific words may 
correspond to neurons or networks of neurons in a certain area 
in the posterior lateral- occipital cortex. This area receives input 
from V4 and is selective for combinations of vertices. It has been 
shown to play a role in the recognition of words and has been 
named Visual Word Form Area (Szwed et al., 201 1). 

3.2. VISION FOR A HOME TIDYING PICKUP ROBOT 

Daily service robots that perform routine tasks are becoming pop- 
ular as household appliances. Such tedious tasks include, but are 
not limited to, vacuum cleaning, setting up and cleaning up a din- 
ner table, tidying up toys, and organizing closets. The design of 
domestic robots is a growing research area (Bandera et al., 2012; 
Jiang et al, 2012). 



^The images in Figures 5A,B are extracted from the files named b01-049.png 
and b01-044.png, respectively, in the JAM offline database. 




■Ht. 2iWg^ of Qur^^fs -iiwtin^ f!osA<cr. Ha oSn&kM 

y>LL 4he, new i/iketha^ bt. iMtOmcJ-- -h cotra^i 
Gurnvmy's -iTaiJit^ U/^pli<i c^okg. 

0^ Sf&C^ fMiar Ge^ai^ ololhf jiimi'iJ ^ 



Sl^i/W- Cdff/vi^'j ejifiorii mxjk ihry i^ic mforis- 





FIGURE 5 I An example of spotting the keyword "Germany" in (A,B) two 
handwritten manuscripts taken from the lAM offline database (IVIarti 
and Bunke, 2002). The indicated keyword "Germany" in (A) is used as a 
prototype to configure an S-COSFIRE filter. (E) The circles indicate the 
support areas of 13 \/-COSFIRE filters that are used to provide input to the 



concerned S-COSFIRE filter with the "+" nnarker indicating its support 
center. (C,D) Nornnalized responses of this filter to the innages in (A,B) 
rendered by shading of the spotted words. All six instances are detected. The 
strongest response is achieved for the word that was used for the 
configuration of the S-COSFIRE filter. 
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We demonstrate how the 5-COSFIRE filters that we propose 
can be used by a personal robot to visually recognize objects of 
interest in indoor environments. As an illustration we consider 
a task for a tidying pickup robot to detect shoes in differ- 
ent rooms of a home that match the prototype shoe shown in 
Figure 6A. 

We use a segmented prototype image of the shoe to config- 
ure an 5-COSFIRE filter. The concerned 5-COSFIRE filter receives 
input from three y-COSFIRE filters that are selective for different 
parts of the shoe. These parts are automatically chosen by the sys- 
tem from a circular local neighborhood of a point of interest that 
is indicated by a marker. In practice, the concerned point of 
interest and the radius of the corresponding local neighborhood 
are manually specified by the user. The radii of the three circles are 
automatically computed in such a way that the circles touch each 
other. For the configuration of the concerned y-COSFIRE filters 
we use a bank of Gabor energy filters^ with one wavelength (A, = 
4) and 16 equidistant orientations = { | 0 . . . 15}), and we 
threshold the responses with ti = 0.3. Within each of the three 
circles, we consider a number of concentric circles, the radii of 
which increment in intervals of 4 pixels starting from 0. For the 
concerned three y-COSFIRE filters as well as the S-COSFIRE fil- 
ter we use the same values of parameters a (a = 0.67) and 
(ao = 0.1) in order to allow the same tolerance in the position of 
the involved edges and curvatures. 

We created a data set that we call RUG-Shoes of 60 color 
images (of size 256 x 342 pixels) by taking pictures in differ- 
ent rooms of the same house. Of these images, 39 contain a 
pair of shoes of interest, another nine contain a single shoe 
and the remaining 12 do not contain any shoes. The distance 
above ground of the digital camera was varied between 50 cm 
and 1 m. All pictures of shoes were taken from the side view of 



^The response of a Gabor energy filter is computed as the L2-norm of the 
responses of a symmetric and anti- symmetric Gabor filters. 




FIGURE 6 I Detection of shoes in complex scenes. (A) A protoype shoe 
used for the configuration of an S-COSFIRE filter. The circles represent the 
non-overlapping supports of three \/-COSFIRE filters, and the "+" nnarker 
indicates the center of support of the concerned S-COSFI RE filter. (Top right) 
The superimposed (inverted) thresholded responses (ti = 0.3) of a bank of 



the corresponding shoes. The shoes were, however, arranged in 
different orientations and their distances from the camera varied 
by at most 25% as compared to the distance which we used to 
take the image of the prototype shoe. We made the RUG-Shoes 
data set publicly available^. 

We use the configured 5-COSFIRE filter to detect shoes in 
the data set of 60 images. We first convert every color image 
to grayscale and subsequently apply the concerned 5-COSFIRE 
filter in reflection-, scale- (v e {|, 1, |}) and partially rotation- 
invariant (V^ G { — 1^, 0, 1^ }) mode. The Gabor energy filters that 
we use to provide inputs to the y-COSFIRE filters are applied 
with isotropic suppression (Grigorescu et al., 2004) in order to 
reduce responses to texture. We threshold the responses of the 
concerned 5-COSFIRE filter with ^3=0.1 and for each image we 
consider only the highest two responses. We obtain a perfect 
detection and recognition performance for all the 60 images in 
the RUG-Shoes data set. This means that we detect all the shoes 
in the given images with no false positives. Figure 6B illustrates 
the detection of some shoes in two of the images. 

4. DISCUSSION 

The trainable S-COSFIRE filters that we propose are part of a hier- 
archical object recognition approach that shares similarity with 
the ventral stream of visual cortex. In the first layer we detect lines 
and edges by Gabor filters, which are inspired by the function 
of orientation- selective cells in primary visual cortex (Daugman, 
1985). Their responses are projected to a second layer and used 
by y-COSFIRE filters that detect vertices and curved contour 
segments. In our previous work (Azzopardi and Petkov, 2013b), 
we showed that such filters give responses that are qualitatively 
similar to a class of cells in area V4 in visual cortex. Finally, 
in a third layer we have 5-COSFIRE filters that combine the 



^The RUG-Shoes data set can be downloaded from http://matlabserver.cs. 
rug.nl/ 




Gabor energy filters with one wavelength (A. = 4)and 16 orientations in intervals 
of 71/8. (Bottonn) Reconstructions of the local patterns for which the three 
resulting \/-COSFIRE filters are selective. (B) Detection results to two input 
innages (of size 256 x 342 pixels) fronn the RUG-Shoes data set with filenannes 
(a) Shoes03_1.jpg, (c) Shoes17_2.jpg, (e) Shoes58_2.jpg, and (g) Shoes38_1.jpg. 
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responses of certain y-COSFIRE filters. Such a filter is selective 
for a given spatial configuration of vertices and curved contour 
segments that defines a simple to moderately complex shape. 
5-COSFIRE filters share similar properties with shape-selective 
neurons in inferotemporal cortex, which provided inspiration for 
this work. 

This hierarchical object recognition approach is, however, not 
restricted to three layers. The addition of further layers may be 
more appropriate for prototype objects of higher deformation 
complexity. For instance, let us consider a prototype shape of a 
simplistic human-body figure that is composed of a head, a pair of 
eyes, a nose, a mouth, two arms, two hands, a torso, two legs, and 
two feet. We may configure an S-COSFIRE filter to be selective for 
the entire body with its center being at the center of mass of the 
body. Such a filter receives input from y-COSFIRE filters that are 
selective for distinct body parts. With this type of configuration 
the tolerance in the position of the body parts is computed with 
the same function that depends on the distance from the center of 
the S-COSFIRE filter. However, we know that certain body parts 
may require more tolerance or may be more correlated than oth- 
ers. For instance, the positions of the eyes, the nose and the mouth 
depend more on the position of the head than on the position of 
the legs. By taking this aspect in consideration it would be better 
to construct a hierarchical filter in the following way: configure 
an S-COSFIRE filter to be selective for the spatial arrangement of 
the head components (eyes, nose, and mouth), an S-COSFIRE fil- 
ter for a hand and an arm, another one for a foot and a leg and 
a fourth one for the torso. Then, the responses of these four S- 
COSFIRE filters may be used as inputs to another, more complex 
S-COSFIRE filter. 

The configuration of an S-COSFIRE filter determines which 
responses of which V-COSFIRE filters need to be multiplied 
in order to obtain the output of the filter. The number of V- 
COSFIRE filters used is a model parameter that is specified by 
the user. This value depends on the shape complexity of the 
concerned prototype (as represented by the number of vertex 
features). The selectivity of an S-COSFIRE filter increases with 
an increasing number of ^-COSFIRE filters. The sizes of the 
y-COSFIRE supports and their position are automatically deter- 
mined in such a way that they do not overlap each other. In future 
work, we will incorporate a learning mechanism in the configu- 
ration stage. It will use multiple prototype examples of the object 
of interest (instead of only one prototype that we use here) and 
negative examples (e.g., other objects and scenes). It will learn 
the optimal number of V-COSFIRE filters as well as the size and 
position of their support in order to maximize selectivity and 
generalization abilities. 

An S-COSFIRE filter achieves a response when all parts of 
a shape of interest are present in a specific spatial arrangement 
around a given point in an image. The rigidity of this geometri- 
cal configuration may vary according to the application at hand. 
The standard deviation of a blurring (Gaussian) function that we 
use to allow for some tolerance depend on the distance from the 
center of the concerned S-COSFIRE filter: it grows linearly with a 
rate that is defined by the parameter a. Small values of a are more 
appropriate for the selectivity of rigid objects. Generalization abil- 
ity increases with an increasing value of a. This mechanism is 



inspired by neurophysiological evidence that the average diameter 
of receptive fields of some neurons in visual cortex increases with 
the eccentricity (Gattass et al, 1988). 

The specific type of function that we use to combine the 
responses of costituent (^-COSFIRE) filters for the considered 
applications is a weighted geometric mean. This output func- 
tion, which is also used to compute a V-COSFIRE filter response, 
proved to give better results than various forms of addition. 
Furthermore, there is psychophysical evidence that human visual 
processing of shape is likely performed by a non-linear neu- 
ral operation that multiplies afferent responses (Gheorghiu and 
Kingdom, 2009). In future work, we plan to experiment with 
functions other than (weighted) geometric mean. 

The application of the home tidying robot in section 3.2 
demonstrates the benefits of the rotation, scale and reflection 
invariances that we use. With one S-COSFIRE filter that is con- 
figured by a single prototype, the filter is able to achieve responses 
to different views of the object used for training. While this ability 
implies more operations, the computational cost does not grow 
linearly with the number of considered views. This is attributable 
to the fact that the responses of the bank of Gabor filters at the 
bottom layer can be shared among the involved ^-COSFIRE fil- 
ters, irrespective of the view. We refer the reader to Azzopardi and 
Petkov (2013a,b) for the technical details. The majority of the new 
operations required due to the invariances are shifting computa- 
tions, which have very low computational cost. In practice, the 
shoe-selective filter used in section 3.2 takes 3.5 s to process an 
image (256 x 342 pixels) with no invariances, and less than 5 s 
with rotation-, scale-, and reflection-invariance. 

The proposed S-COSFIRE filters are particularly useful due to 
their versatility and selectivity, in that an S-COSFIRE filter can 
be configured to be selective for any given deformable object and 
used to detect other objects embedded in complex scenes that 
are perceptually similar to it. This effectiveness is attributable to 
taking into account the mutual spatial positions of the responses 
of certain V-COSFIRE filters that are selective for simpler object 
parts. 

5. CONCLUSIONS 

The S-COSFIRE filters that we propose are highly effective to 
detect and recognize deformable objects that are embedded in 
complex scenes without prior segmentation. This effectiveness is 
due to the deployment of both the presence of certain object- 
characteristic features and their mutual spatial arrangement. They 
are versatile shape detectors as they can be trained to be selective 
for any given visual pattern of interest. 

An S-COSFIRE filter is conceptually simple and easy to imple- 
ment: the filter output is computed as the weighted geometric 
mean of blurred and shifted responses of simpler V-COSFIRE 
filters. 
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